2019 S&P 500 Data Exploration Stage

by Marc Angelo Acebedo

Table of Contents

I) Introduction

  • I kept features in separate CSVs because date formats differ. eps_fc and eps_act

After cleaning the original dataset as documented in my data wrangling process, data_cleaning.ipynb, I isolated the following columns:

features.csv

  • firm_id : foreign key referring to primary keys in firms.csv
  • feature : type of feature that the value field denotes (eps_fc, eps_act, eod_act, eps_fc_terms)
  • date : DateTime object in YYYY-MM-DD format
    • for eod_act, means the date at which the value was recorded
    • for eps_fc_terms, means the date at which the term forecast was made
  • term : period object in YYYYQQ format (the time period when the value was recorded)
    • FISCAL years for eps_fc and eps_act
    • CALENDAR years for eod_act and eps_fc_terms
  • value : displays the EPS or EOD value as specified in the 'feature' column.

avgs.csv

  • firm_id : foreign key referring to the primary keys in firms.csv
  • average : recorded average value as specified in the average_type column
  • average_type : type of average denoted (twenty year, quarterly, or yearly)
  • time_period : time period that the average is recorded:
    • for twenty-year averages, it's NaN
    • for yearly averages, displays the year (YYYY)
    • for quarterly averages, displays the quarter (QQ)
  • feature : type of feature recorded for the average (eps_fc, eps_act, eod_act, eps_fc_terms)

II) Data Setup & Overview

In [1]:
#import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sb
import statsmodels.api as sm
import math 
import random
import calendar

from matplotlib import cm

%matplotlib inline
plt.style.use('bmh')
In [2]:
#define data directories
PATH_CLEAN = './data/clean/'
PATH_CLEAN_AVGS = './data/clean/averages/'
In [3]:
#define visuals destination
PATH_UNIVARIATE = './visuals/univariate/'
PATH_BIVARIATE = './visuals/bivariate/'
PATH_MULTIVARIATE = './visuals/multivariate/'

Import features and all averages

In [4]:
features = pd.read_csv(PATH_CLEAN + 'features.csv', low_memory = False)
avgs = pd.read_csv(PATH_CLEAN_AVGS + 'avgs.csv')
In [5]:
#import firm_ids for foreign key references
firm_ids = pd.read_csv(PATH_CLEAN + 'firms.csv')

Describe Datasets

In [6]:
#look at 5 random entries
features.sample(5)
Out[6]:
firm_id feature date term value
144311 227 eps_act NaN 1999Q4 NaN
83470 7 eps_fc NaN 2014Q3 0.5980
108296 303 eps_fc NaN 2005Q1 NaN
8686 103 eod_act 2007-09-28 2007Q3 51.2800
18435 219 eod_act 2008-12-31 2008Q4 3.1875
In [7]:
avgs.sample(5)
Out[7]:
firm_id average average_type time_period feature
20900 195 NaN yearly 2015 eps_act
23803 68 52.817600 yearly 2000 eod_act
2052 32 NaN yearly 1999 eps_fc
42174 259 0.646000 yearly 2016 eps_fc_terms
48651 171 70.076548 quarterly q2 eod_act
In [8]:
print('FEATURES rows, columns = {}'.format(features.shape), '\n',
      'AVERAGES rows, columns = {}'.format(avgs.shape))
FEATURES rows, columns = (167660, 5) 
 AVERAGES rows, columns = (52015, 5)

Convert DateTime columns

In [9]:
features.dtypes
Out[9]:
firm_id      int64
feature     object
date        object
term        object
value      float64
dtype: object
In [10]:
features.date = pd.to_datetime(features.date)
features.term = pd.to_datetime(features.term).dt.to_period('Q')
In [11]:
#verify dtypes
features.dtypes
Out[11]:
firm_id             int64
feature            object
date       datetime64[ns]
term        period[Q-DEC]
value             float64
dtype: object
In [12]:
avgs.dtypes
Out[12]:
firm_id           int64
average         float64
average_type     object
time_period      object
feature          object
dtype: object
In [13]:
features.sample(5)
Out[13]:
firm_id feature date term value
127890 31 eps_act NaT 2010Q3 0.090
148594 278 eps_act NaT 1999Q3 NaN
137053 140 eps_act NaT 2012Q2 0.640
118801 428 eps_fc NaT 2006Q2 NaN
61862 243 eps_fc_terms 2000-04-01 2000Q3 0.233

III) Preliminary Exploration

Structure of the Datasets

Our features.csv dataset contains 167,660 entries with 5 features. Firm ID, feature, date, and term are all categorical variables while the value field is numeric and continuous. Even though the date and term fields are recorded as DateTime and Period objects respectively, they are still discrete, categorical data, because there is a limit to the year that can be recorded (1999 - 2020) and there cannot be more than 4 quarters (Q1 - Q5).

Our avgs.csv dataset contains 52,015 entries with 5 features. Firm ID, average type, time period, and feature are all categorical variables while the average field is numeric and continuous.

Main features(s) of interest in the dataset

I'm interested in seeing the historical correlation of forecasted vs. actual EPS across all firms in the 2019 S&P Index.

  • eps_fc, eps_act are the main variables of interest—they are consistent in measuring both the forecasted and actual EPS of all 505 firms, per fiscal period, over a span of 20 years.
  • eod_act is recorded based on calendar period instead of fiscal period, which is a flaw in the data. However, it can be used for some further exploration.
  • eps_fc_terms is based on calendar period. Although it depicts forecasted EPS, this is still not a main variable of interest because it can extend my main research question, but not fully answer it.

As somebody with very little familiarity with the stock market, I decided to dedicate this thesis project to my exploration of a field I am not familiar with.. I read the book "A Beginner's Guide to the Stock Market" by Matthew R. Kratter, which piqued my interest in stock market trading—particularly dividend stocks. I decided that if I were going to educate myself further on the stock market, then this thesis project to end my senior year at NCF would be a perfect opportunity to directly explore this new interest. Not only would I be educating myself on how the stock market works, but I would also be working directly with stock market data, which will help me build further intuition in future stock market and finance-related projects.

These are the questions I'd like to pose in my exploratory stage:

  1. Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?

  2. I generate “dumb EPS forecasts” by calculating the rolling mean of actual EPS from the past 2 quarters. How do my forecasted EPS forecasts compare to Bloomberg forecasts?

  3. What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?

  4. Does statistical significance stay true among the relationships between actual EPS and forecasted EPS, regardless of forecasted type (raw data along with all yearly, quarterly, and twenty-year averages)?

  5. How do EOD prices trend across all firms from 1999 - 2019?

Features in the dataset that will support my investigation into the features of interest

For the broadest overview, I predict that overtime, EPS forecasts will continually become optimistic for those firms that consistently have high actual EPS values. Vice versa, overtime, EPS forecasts will continually become pessimistic for those firms that consistently have lower-than-expected EPS values. As for the other factors, I expect yearly values to show more consistency in pattern (since it is more intuitive to measure 20 years) than quarterly values (since economic situations are greatly diverse over the period of 20 years, no matter the period).

IV) Exploration

A) Univariate

MISSING VALUES (Features)

In [14]:
sb.set(style = "darkgrid")
In [15]:
def generate_missing_total(df, title_name, save_path, csv_name):
    plt.figure(figsize = [10, 5])
    plt.title('Missing Values per Column under ' + title_name, size = 20)
    na_counts = df.isna().sum().sort_values(ascending = True)
    
    na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
    plt.xlabel('Count', size = 10)
    plt.ylabel('Column Name', size = 10)
    plt.savefig(save_path + csv_name)
In [16]:
features.isna().any()
Out[16]:
firm_id    False
feature    False
date        True
term       False
value       True
dtype: bool
In [17]:
generate_missing_total(features, '"Features"', PATH_UNIVARIATE, 'features-missing-total.png')

Observation 1: date is the field with the highest amount of missing data, with around 85,000 missing entries.

Observation 2: value contains around 26,000 missing entries.*

Observation 3: The gap in number of missing values between value and date is noticeably large.

  • This makes sense because date is automatically set to 'NaT' for eps_fc and eps_act.

Questions

1) For all the entries where date is not NaT, how large is the gap in missing values between values without the dropped dates and values with the dropped dates?

In [18]:
features_date_dropna = features[features['date'].notna()]
In [19]:
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Features"', size = 20)
plt.title('after dropping empty dates', size = 15)
na_counts = features_date_dropna.isna().sum().sort_values(ascending = True)

fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'features-missing-date-dropna.png')

Question 1: For all the entries where date is not NaT, how large is the gap in missing values between values without the dropped dates and values with the dropped dates?

Answer

Observation 1: Keeping NaT dates, the amount of overall missing values is around 26,000. After dropping all the NaT dates (dropping eod_actand eps_fc_terms), the amount of missing values dropped to around 14,000.

Observation 2: Drawing from the previous observation, this means that eps_fc and eps_act both have around 12,000 missing values in total.

Observation 3: Effectively, the undropped columns, eod_act and eps_fc_terms, have around 14,000 missing values in total.


Questions:

2) How many missing values does each feature have, individually?

In [20]:
#turn feature values into index values
features_num_nan = features[['feature', 'value']].groupby('feature').count()
In [21]:
#count number of total values (NaN or not) per feature
feature_counts = features.feature.value_counts()
In [22]:
#create function to calculate number NaN
def calculate_nan(num_nan, counts, feature_name):
    num_nan.loc[feature_name] = counts[feature_name] - num_nan.loc[feature_name]
    #num NaN (feature) = num all values (NaN or not) - num all values (not NaN)
    
#create array of feature names
feature_names = ['eps_fc', 'eps_act', 'eod_act', 'eps_fc_terms']
In [23]:
#update all nan counts per average type
for feature_name in feature_names:
    calculate_nan(features_num_nan, feature_counts, feature_name)
In [24]:
plt.figure(figsize = [10, 5])
sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Features"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')

sb.barplot(x = features_num_nan.index, y = features_num_nan.value, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'features-missing-per-feature.png')

Observation 1: The feature with the most amount of missing values is eps_fc_terms.

Observation 2: The feature with the least amount of missing values is eps_act.

Observation 3: eod_act, eps_fc, and eps_fc_terms all have a very small gap in missing data comparedto eps_act

Question 2: How many missing values does each feature have, individually?

Answer:

Observation 1: Below is a list of each feature and their corresponding estimated number of missing values:

  • eod_act : 7900
  • eps_act : 5500
  • eps_fc : 7050
  • eps_fc_terms : 7100

Observation 2: eps_fc and eps_act have around 12,000 missing values total, which is consistent with our previous observation.

Observation 3: eod_act and eps_fc_terms have around 14,000 missing values total, which is also consistent with our previous observation.

MISSING VALUES (Averages)

In [25]:
avgs.isna().any()
Out[25]:
firm_id         False
average          True
average_type    False
time_period      True
feature         False
dtype: bool
In [26]:
#averages
generate_missing_total(avgs, 'Averages', PATH_UNIVARIATE, 'avgs-missing-total.png')

Observation 1: The average column contains 6,000 missing data entries.

Observation 2: The time_period column contains exactly 2,000 missing data entries.

Observation 3: The gap in missing values between average and time_period is 4,000.

Observation 3: By default, all entries with average_type of twenty_year should have NaT fields for time_period.


Questions:

3) After dropping all NaT fields under time_period, how large will the gap in missing values be between values with the NaT dates and values after dropping the NaT dates?

4) How many missing average values does each average_type contain?

In [27]:
avgs_date_dropna = avgs[avgs['time_period'].notna()]
In [28]:
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Averages"', size = 20)
plt.title('after dropping empty time periods', size = 15)
na_counts = avgs_date_dropna.isna().sum().sort_values(ascending = True)

fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-date-dropna.png')

Question 3: After dropping all NaT fields under time_period, how large will the gap in missing values be between values with the NaT dates and values after dropping the NaT dates?

Answer:

Observation 1: After dropping all NaT dates under time_period, there are still 6,000 missing averages. This is number is consistent with the number of missing averages even before dropping NaT dates. Thus, all entries with a missing time period (aka all twenty_year average types) did not contain any empty data.

To summarize the above, the number of missing average values remains the same whether or not you drop all entries with empty time periods, aka all twenty_year average types.

In [29]:
#turn avg values into index values
avgs_num_nan = avgs[['feature', 'average']].groupby('feature').count()
In [30]:
#count number of total values (NaN or not) per feature
avgs_counts = avgs.feature.value_counts()
In [31]:
#update number of missing values for each average type
for feature_name in feature_names:
    calculate_nan(avgs_num_nan, avgs_counts, feature_name)
In [32]:
avgs_num_nan
Out[32]:
average
feature
eod_act 1539
eps_act 1255
eps_fc 1655
eps_fc_terms 1562
In [33]:
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Averages"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')

sb.barplot(x = avgs_num_nan.index, y = avgs_num_nan.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-feature.png')

Observation 1: eps_fc has the most missing values at around 1650.

Observation 2: eps_act has the least amount of missing values at around 1250.

Observation 3: Under "Averages", the gap in missing values between each average type is larger than the missing values under "Features".

Answer:

Observation 1: Each average type has the following approximate number of missing averages:

  • eod_act : 1550
  • eps_act : 1250
  • eps_fc : 1650
  • eps_fc_terms : 1590
In [34]:
#turn avg values into index values
avgs_num_avg = avgs[['average_type', 'average']].groupby('average_type').count()
In [35]:
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Average Type under "Averages"', size = 20)
plt.xlabel('Average Type')
plt.ylabel('Count')

sb.barplot(x = avgs_num_avg.index, y = avgs_num_avg.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-avg-type.png')

Question 4: How many missing average values does each average_type contain?

Answer:

Observation 1:

  • quarterly: ~7500
  • twenty_year: ~2500
  • yearly: ~3600

Observation 2: The gap in missing values is greatly diversified across all average types.

FIRM_ID (Features)

Since 505 firms is too much to fit into one single visual, I decided to split them apart by focusing on the 20 most common firm ids and the 20 rarest firm ids.

In [36]:
def generate_pct_bar(df, cat_var, color):
    cat_counts = df[cat_var].value_counts()
    ax = sb.countplot(data = df, y = cat_var, order = cat_counts.index, palette = color)
    
    n_points = df.shape[0]
    locs, labels = plt.yticks()
    
    for p in ax.patches:
        percentage = '{:0.1f}%'.format(100 * p.get_width()/n_points)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y), size = 20)
In [37]:
features_firm_id_top = features.firm_id.value_counts()[:20].index
features_firm_id_top_lim = features.loc[features.firm_id.isin(features_firm_id_top)]

#check there are only 50 unique values
features_firm_id_top_lim.firm_id.nunique()
Out[37]:
20
In [38]:
plt.figure(figsize = [20, 15])
x = features_firm_id_top_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('Top 20 Most Common Firm IDs under "Features"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-top.png');
plt.show();
In [39]:
features_firm_id_bottom = features.firm_id.value_counts()[-20:].index
features_firm_id_bottom_lim = features.loc[features.firm_id.isin(features_firm_id_top)]

#check there are only 50 unique values
features_firm_id_bottom_lim.firm_id.nunique()
Out[39]:
20
In [40]:
plt.figure(figsize = [20, 15])
x = features_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Features"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-bottom.png');
plt.show();

Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Features", there is a consistent count among all Firm IDs at around 335.

In [41]:
#check count consistency among firm_ids
firm_counts = features.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
Out[41]:
array([332], dtype=int64)

I discovered that all firm id counts are consistent across the entire features.csv dataset at 332 entries per firm id. There are no null firm ids.

FEATURE (Features)

In [42]:
import itertools
plt.figure(figsize = [13, 8])

#set palette
base_color = sb.color_palette()[5]

#countplot
plt.subplot(1, 2, 1)
sb.countplot(data = features, x = 'feature', order = features.feature.value_counts(ascending = True).index,
            color = base_color)
frame = plt.gca()

#pie chart
plt.subplot(1, 2, 2)
sorted_counts = features['feature'].value_counts()
plt.pie(features.feature.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = features.feature.value_counts().index);
plt.axis('square');

#overall graphic
plt.suptitle('All Features by Count and Percentage under "Features"', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'features-feature-pct-count.png')

Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 42500 entries each (25.30% each).

Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 41,000 (24.10%).

Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. (effectively removing 2020 entries).

DATE (Features)

In [43]:
#double check that eps_fc and eps_act are the only features to have null Date entries
features[features.date.isna()].feature.unique()
Out[43]:
array(['eps_fc', 'eps_act'], dtype=object)

Since we acknowledged that the date column is set to NaT for eps_fc and eps_act, we will write all our interpretations accordingly.

All below graphs under "Date" do not take into account the features eps_fc and eps_act.

In [44]:
features_years = features.date.dt.year.dropna().astype(int)
features_months = features.date.dt.month.dropna().astype(int)
In [45]:
#years
plt.figure(figsize = [15, 5])

ax = sb.countplot(x = features_years, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Years under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-years-count.png')
plt.show()

Observation 1: All years between 2000 - 2018 have a consistent count at around 4,000 for all firms.

  • This makes sense, because the year 1999 is missing from all eps_fc_terms entries.

Observation 2: The years 1999 and 2019 are both inconsistent and less than the number of 4,000 counts for all other Date years.

Observation 3: The year 1999 has 2500 non-null entries.

Observation 4: The year 2019 has 3500 non-null entries.

In [46]:
#months
features_months = features_months.apply(lambda x: calendar.month_abbr[x])
months_order = ['Jan', 'Mar', 'Apr', 'Jun', 'Jul', 'Sep', 'Oct', 'Dec']
In [47]:
plt.figure(figsize = [10, 5])

ax = sb.countplot(data = features, x = features_months, color = sb.color_palette()[0], order = months_order)
ax.set(xlabel = 'Month', ylabel = 'Count')
ax.set_title('Frequency of Months under "Features"', size = 20)
sb.despine(offset = 10, trim = False)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-months-count.png')
plt.show();

Observation 1: September has the most number of counts at around 15,000.

Observation 2: October has the least number of counts at around 9,000.

Observation 3: The trend in counts of months under "Date" fluctuates. It is not a linear or exponential pattern; it seems that there is a "peak" in counts every 2nd recorded month from January.

  • For example, March peaks at around 11,000 counts, then June peaks at around 10,500 counts, and Septemberat around 15,000.

It is safe to conclude that the "Date" column is unreliable when examining terms and years, and should be avoided. It is better to use the "Date" column only when referring to specific dates.

TERM (Features)

As noted earlier, there are no missing values under the Term column. Therefore, all graphs below do account for eps_act and eps_fc.

In [48]:
#years
plt.figure(figsize = [20, 10])

ax = sb.countplot(data = features, x = features.term.dt.year, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Term Years under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-years-count.png')
plt.show();

Observation 1: All years from 2000 to 2019 have consistent counts at 8,000 per year.

Observation 2: Unlike the years under Date, 1999 is the only year that does not follow the general trend. There are 6000 recorded entries containing 1999. This means that 2,000 entries do not contain the year 1999.


In [49]:
#quarter
plt.figure(figsize = [10, 5])

ax = sb.countplot(data = features, x = features.term.dt.quarter, color = sb.color_palette()[0])
ax.set(xlabel = 'Quarter', ylabel = 'Count')
ax.set_title('Frequency of Term Quarters under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-quarters-count.png')
plt.show();

Observation 1: There is a consistent number of quarters under term, unlike years.

Observation 2: This means that quarters is a more stable, reliable variable to examine under Term unlike years, which should be examined more closely in regard to which feature(s) contains that year.

VALUE (Features)

In [50]:
def generate_hist(df, x, bins, title, xlabel, ylabel, save_path, csv_name):
    plt.figure(figsize = [14, 7])
    
    plt.hist(data = df, x = x, bins = bins, color = 'palevioletred')
    plt.title(title, size = 25)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.savefig(save_path + csv_name)
In [51]:
value_bins = np.arange(features.value.min(), features.value.max() + 100, 100)
generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist.png')

Observation 1: The trend in value counts is heavily right-skewed

Observation 2: The value range 0-100 contains the highest concentration of data, with over 120,000 entries.

Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.


Questions:

4) How do value counts under Features look like after removing all outliers around the value range 0-300?

In [52]:
value_bins = np.arange(-5, 200 + 10, 10)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (without outliers)', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-1.png')

Question 5: How do value counts under Features look like after removing all outliers around the value range 0-300?

Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the graph with all the outliers, where the right skew is much more pronounced and prominent.

Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.

Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.

  • The previous graph showed that the bin range 0-100 contains around 120,000 points.
  • This graph shows that the bin range 0-10 contains around 120,000 points.
  • Drawing from the previous 2 observations, we can conclude that it's really the bin range 0-10 that contains the bulk of all the data.

It would be a good idea to break down the 0-10 bin range even further, and examine solely that range.

In [53]:
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (range 0-10)', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-2.png')

Observation 1: The graph, even when zoomed in, still retains a strong right skew.

Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.

Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:

In [54]:
def generate_distplot(data, bins):
    fig = plt.figure(figsize = [14, 7])
    ax = sb.distplot(data, bins = bins, color = 'hotpink')
    ax.minorticks_on()
    return fig, ax
In [55]:
value_bins = np.arange(features.value.min(), features.value.max() + 20, 20)
generate_distplot(features.value.dropna(), bins = value_bins)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Distribution of all Values under "Features"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'features-value-dist.png')

Observation: As shown by the kernel density curve, the distribution of values under Features is heavily right-skewed. This means that most values are clustered around the left-side of the distribution where the mean, median, and mode are all located.

FIRM_ID (Averages)

In [56]:
avgs_firm_id_top = avgs.firm_id.value_counts()[:20].index
avgs_firm_id_top_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]

#check there are only 20 unique values
avgs_firm_id_top_lim.firm_id.nunique()
Out[56]:
20
In [57]:
plt.figure(figsize = [15, 15])
x = avgs_firm_id_top_lim
num_bins = 5
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID')
plt.ylabel('Count')
plt.title('Top 20 Most Common Firm IDs under "Averages"', size = 20)

plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-top.png');
plt.show();
In [58]:
avgs_firm_id_bottom = avgs.firm_id.value_counts()[-20:].index
avgs_firm_id_bottom_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]

#check there are only 20 unique values
avgs_firm_id_bottom_lim.firm_id.nunique()
Out[58]:
20
In [59]:
plt.figure(figsize = [20, 15])
x = avgs_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Averages"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-bottom.png');
plt.show();

Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Averages", there is a consistent count among all Firm IDs at around 105.

In [60]:
#check consistency of counts among firm_ids
firm_counts = avgs.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
Out[60]:
array([103], dtype=int64)

I discovered that all firm id counts are consistent across the entire avgs.csv dataset at 105 entries per firm id. There are no null firm ids.

AVERAGE (Averages)

In [61]:
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 100, 100)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages', 
                'Average(twenty-year, quarterly, yearly)',
                'Count', PATH_UNIVARIATE, 'avgs-avg-hist.png')

Observation 1: The trend in value counts is heavily right-skewed

Observation 2: The value range 0-100 contains the highest concentration of data, with around 36,000 entries.

Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.

Observation 4: There is a all values, whether under Features or Averages, share a common theme of taking up the "bulk" of data in the same bin range


Questions:

6) How do value counts under Averages look like after removing all outliers around the value range 0-300?

In [62]:
value_bins = np.arange(-5, 200 + 10, 10)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (without Outliers)', 
                'Average(twenty-year, quarterly, yearly)',
                'Count', PATH_UNIVARIATE, 'avgs-avg-hist-zoom-1.png')

Question 6: How do average counts under Averages look like after removing all outliers around the value range 0-300?

Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the previous graph with all the outliers, where the right skew is much more pronounced and prominent.

Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.

Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.

  • The previous graph showed that the bin range 0-100 contains around 36,000 points.
  • This graph shows that the bin range 0-10 contains around 34,000 points.
  • Drawing from the previous 2 observations, we can conclude that it's really the bin range 0-10 that contains the bulk of all the data, with the other ~2,000 points being scattered among the bins from 10 to 110.

Observation 4: Among the bins from 10 to 110 that contain the other 2,000 counts, the bin 30-40 contains the "bulk" of that data at around 2,000 counts.

It would be a good idea to break down the 0-10 bin range even further for Averages, and examine solely that range.

In [63]:
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (range 0-10)', 
                'Average',
                'Count', PATH_UNIVARIATE, 'avgs-avgs-hist-zoom-2.png')

Observation 1: The graph, even when zoomed in, still retains a strong right skew.

Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.

Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:

In [64]:
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 20, 20)
generate_distplot(avgs.average.dropna(), bins = value_bins)
plt.xlabel('Average')
plt.ylabel('Density')
plt.title('Distribution of all Averages under "Averages"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'avgs-average-dist.png')

Observation: As shown by the kernel density curve, the distribution of average under Averages is heavily right-skewed. This means that most averages are clustered around the left-side of the distribution where the mean, median, and mode are all located.

A KEY TAKEAWAY: There is a strong congruency between the distribution patterns of stock prices (values) under features.csv and average prices (averages) under Averages.

AVERAGE TYPE (Averages)

In [65]:
plt.figure(figsize = [10, 5])
cat_order = avgs.average_type.value_counts().index
sb.countplot(data = avgs, x = 'average_type', color = 'hotpink', order = cat_order)
plt.xlabel('Average Type')
plt.ylabel('Count')
plt.title('Average Type by Count', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-count.png')

Observation 1: Under Averages, yearly data has the highest number of counts while twenty-year data has the least number of counts. This all makes sense, as there are more yearly averages than quarters per firm, and more quarters than twenty-year entries per firm.

  • For each firm, there are around 20 yearly averages (1999 to 2019) for each of the 4 features. (80 total per firm).
  • For each firm, there are 4 quarterly averages for each of the 4 features. (16 total per firm)
  • For each firm, there is 1 twenty-year average for each of the 4 features. (4 total per firm)

Note: number of entries may vary for yearly averages because the year 1999 and 2019 are missing for some features.

In [66]:
#keep for analysis, delete when done w/ report
len(avgs.query('average_type == "yearly" and firm_id == 0'))
Out[66]:
83
In [67]:
colors = random.choices(list(mcolors.CSS4_COLORS.values()), k = 3)

plt.figure(figsize = [10, 5])
cs=cm.Set1(np.arange(40)/40.)

plt.pie(avgs.average_type.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = avgs.average_type.value_counts().index, colors = colors);
plt.suptitle('Average Type by Percentage', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.axis('square');
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-pie.png')

Observation 1: As expected, yearly average types make up the majority of all entries at 80.58%.

Observation 2: As expected, twenty-year average types make up the least portion of all entries at 3.88%.

Observation 3: These percentages are consistent with the previous counts.

TIME_PERIOD (Averages)

In [68]:
plt.figure(figsize = [15, 15])
cat_order = avgs.time_period.value_counts().sort_index().index
sb.countplot(data = avgs, x = 'time_period', color = 'hotpink', order = cat_order)
plt.xlabel('Time Period (yearly or quarterly)')
plt.ylabel('Count')
plt.title('Yearly & Quarterly Time Periods by Count (Averages)', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-time-period-hist.png')

I decided to portray both yearly and quarterly Time Periods in the same countplot because q1-q4 and the years 2000-2019 show consistent counts.

Observation 1: All years from 2000-2019 and all quarters show consistent counts, at just over 2,000 entries each.

Observation 2: The year 1999 has the least amount of counts, at only around 1510. This is because the feature eps_fc_terms does not contain the year 1999.

Observation 3: The time period for twenty-year averages does not show because by default, all Time Period entries associated with average type of twenty-year is set to null.

FEATURE (Averages)

In [69]:
plt.figure(figsize = [14, 7])

#set palette
base_color = sb.color_palette()[5]

#countplot
plt.subplot(1, 2, 1)
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
# generate_pct_bar(avgs, 'feature', 'Blues_d')
sb.countplot(data = avgs, x = avgs.feature, color = 'hotpink', order = cat_order)

#pie chart
plt.subplot(1, 2, 2)
sorted_counts = avgs['feature'].value_counts()
plt.pie(avgs.feature.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = avgs.feature.value_counts().index);
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
plt.axis('square');

#overall graphic
plt.suptitle('All Features by Count and Percentage under "Averages"', size = 20)
# plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'avgs-feature-hist-pie.png')

Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 13,000 entries each (25.24% each).

Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 12,500 (24.27%).

Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. This effectively removes around 500 entries: not a substantial amount of data.

IV) Bivariate Exploration

General rule: The closer the eps_act - eps_fc difference is to 0, the more accurate the forecast.

VALUE by FEATURE

In [70]:
temp = features[features.feature.isin(['eps_fc', 'eps_act'])]
In [71]:
plt.figure(figsize = [10, 10])
sb.stripplot(x = 'feature', y = 'value', data = features, jitter = True)

plt.title('Stock Price Types and ', size = 20)
plt.xlabel('Stock Price Type')
plt.ylabel('Value')

plt.xticks([0, 1, 2, 3], ['EOD', 'Forecasted EPS (3 Months Prior)', 'Forecasted EPS', 'Actual EPS'])
plt.savefig(PATH_BIVARIATE + 'features-feature-value.png')
plt.show();

Observation 1: EOD Price has the most variability among the other stock price types.

Observation 2: Forecasted EPS (3 months prior) and Forecasted EPS (current) are the only stock price types to appear uniforn in distribution: their values are clustered around 0.

Observation 3: Actual EPS has the most outliers compared to the other stock price types: 5 outliers. And since actual EPS has the most outliers, it turns out that forecasters' predictions have been too uniform and stable in the approach of analyzing stock market trends.


EOD Price by Term (features.csv)

In [72]:
features_eod = features[features.feature == 'eod_act']

#create 5-year moving averages to "smooth" out all future graphs
features_eod['ma_value'] = features_eod.value.rolling(12).mean()
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
In [73]:
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')

plt.title('Average EOD Price by Term Across All Firms', size = 20)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')

plt.savefig(PATH_BIVARIATE + 'features-eod-term.png')

Observation 1: The graph depicts a positive linear trend from all quarters spanning 1999 - 2019.

Observation 2: The average EOD Price troughs during 2009 Quarter 1. The start of the trend in this downfall starts in 2007 Quarter 4, but the average EOD price recovers shortly after, only 1 quarter later on 2009, Quarter 2.

In layman's terms, this generally positive linear trend shows that all 505 firms in the S&P 2019 Index have been getting "richer" over the past 20 years.

In [74]:
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'ma_value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')

plt.suptitle('Average EOD Price by Term Across All Firms', size = 20, y = .93)
plt.title('in 3-year Moving Averages from 1999 - 2019', size = 15)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')

plt.savefig(PATH_BIVARIATE + 'features-eod-term-ma.png')

The EOD graph could be used to make correlations with EPS forecasts and how the firm is getting "richer" overtime.

DIFFERENCE IN FORECASTED AND ACTUAL EPS (PREDICTION ERROR) BY TERM

In [75]:
#isolate eps_fc and eps_act in their own columns to get their difference
features_diffs = features[features.feature.isin(['eps_fc', 'eps_act'])][['firm_id', 'term', 'value', 'feature']]
In [76]:
#isolate eps_fc and eps_act
eps_fc_diffs = features_diffs[features_diffs.feature == 'eps_fc'].rename(columns = {'value' : 'eps_fc_value'})[['firm_id', 'term', 'eps_fc_value']]
eps_act_diffs = features_diffs[features_diffs.feature == 'eps_act'].rename(columns = {'value' : 'eps_act_value'})[['firm_id', 'term', 'eps_act_value']]
In [77]:
#merge, creating separate columns for eps_fc and eps_act values
act_fc_diffs = eps_fc_diffs.merge(eps_act_diffs, how = 'inner', left_on = ['firm_id', 'term'], right_on = ['firm_id', 'term'])

act_fc_diffs.sample(5)
Out[77]:
firm_id term eps_fc_value eps_act_value
28508 339 2007Q1 0.229 0.150
15512 184 2013Q1 1.400 1.460
37096 441 2012Q1 0.328 0.120
35955 428 1999Q4 NaN NaN
11698 139 2004Q3 0.295 0.325
In [78]:
#create 'difference' column
act_fc_diffs['difference'] = act_fc_diffs['eps_act_value'] - act_fc_diffs['eps_fc_value']
In [79]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('across All Firms', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term.png')
plt.show();

Observation 1: Forecasters were most optimistic in 2008, Quarter 4.

Observation 2: Forecasters were most pessimistic in 2000, Quarter 4.

Observation 3: The trend indicates no linear relationship; it follows in a stable, one-dimensional direction with average prediction errors center around 0.

In [80]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year.png')
plt.show();

Observation 1: Forecasters were most pessimistic in the year 2000. From the error bar, the year 2000 shows one of the widest variances in average prediction errors ranging from 0 until 0.5

Observation 2: Forecasters were most optimistic in the year 2008. From the error bar, the year 2008 shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.

Observation 3: The overall trend stays consistent with the previous trends by term: consistent, relatively stable, and containing no slope.

In [81]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter.png')
plt.show();

Observation: The later the quarter, the more optimistic forecasters become in their predictions. I am assuming this is because each new quarter brings the familiarity of the current year's stock market dynamics.

DIFFERENCE IN MOVING AVERAGE FORECASTED AND ACTUAL EPS (PREDICTION ERROR) BY TERM

In [82]:
#3-year moving averages
act_fc_diffs['ma_difference'] = act_fc_diffs.difference.rolling(12).mean()
In [83]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term-ma.png')
plt.show();
In [84]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year-ma.png')
plt.show();
In [85]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter-ma.png')
plt.show();

Observation 1: Compared to their non-moving-average counterparts, the above pointplots plotting 3-year moving averages show a consistent slopeless trned, including periods of optimism vs. pessimism.

DIFFERENCE IN TWENTY YEAR AVERAGE OF EPS_FC AND EPS_ACT BY FIRM (Averages)

In [86]:
#array of all firm ids
ids = firm_ids.firm_id.values
In [87]:
#put eps_fc and eps_act averages into separate DFs
avgs_eps_fc = avgs[avgs['feature'] == 'eps_fc']
avgs_eps_act = avgs[avgs['feature'] == 'eps_act']
In [88]:
#isolate eps_fc and eps_act DFs by "twenty_year" average type
avgs_eps_fc_twenty = avgs_eps_fc[avgs_eps_fc['average_type'] == 'twenty_year']
avgs_eps_act_twenty = avgs_eps_act[avgs_eps_act['average_type'] == 'twenty_year']
In [89]:
#df of only twenty-year average types
avgs_twenty = pd.concat([avgs_eps_fc_twenty, avgs_eps_act_twenty])
In [90]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_twenty = avgs_twenty.query('feature == "eps_fc"').average.values
avg_eps_act_twenty = avgs_twenty.query('feature == "eps_act"').average.values
twenty_diff_data = list(zip(ids, avg_eps_fc_twenty, avg_eps_act_twenty))

twenty_diffs = pd.DataFrame(twenty_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
twenty_diffs['difference'] = twenty_diffs['average_eps_act'] - twenty_diffs['average_eps_fc']
twenty_diffs['average_type'] = 'twenty_year'
In [91]:
twenty_diffs.sample(5)
Out[91]:
firm_id average_eps_fc average_eps_act difference average_type
129 129 1.598345 1.647725 0.049380 twenty_year
243 243 0.577012 0.552115 -0.024897 twenty_year
418 418 0.574786 0.560972 -0.013813 twenty_year
390 390 0.583726 0.425149 -0.158577 twenty_year
123 123 0.619869 0.698101 0.078232 twenty_year
In [92]:
plt.figure(figsize = [20, 20])
plt.title('Prediction Error among Twenty-Year Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(twenty_diffs['firm_id'], twenty_diffs['difference'])

plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year.png')
plt.show();

DIFFERENCE IN QUARTERLY AVERAGE OF EPS_FC AND EPS_ACT (Averages)

In [93]:
#isolate eps_fc and eps_act DFs by "quarterly" average type
avgs_eps_fc_quarter = avgs_eps_fc[avgs_eps_fc['average_type'] == 'quarterly']
avgs_eps_act_quarter = avgs_eps_act[avgs_eps_act['average_type'] == 'quarterly']
In [94]:
#df of only quarterly averages
avgs_quarter = pd.concat([avgs_eps_fc_quarter, avgs_eps_act_quarter])
In [95]:
#get the average average separately per firm
avgs_quarter_sep = avgs_quarter.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_quarter_sep.shape[0]
Out[95]:
1010
In [96]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_quarter = avgs_quarter_sep.query('feature == "eps_fc"').average.values
avg_eps_act_quarter = avgs_quarter_sep.query('feature == "eps_act"').average.values

quarter_diff_data = list(zip(ids, avg_eps_fc_quarter, avg_eps_act_quarter))

quarter_diffs = pd.DataFrame(quarter_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
quarter_diffs['difference'] = quarter_diffs['average_eps_act'] - quarter_diffs['average_eps_fc']
quarter_diffs['average_type'] = 'quarter'
In [97]:
quarter_diffs.head()
Out[97]:
firm_id average_eps_fc average_eps_act difference average_type
0 0 0.530992 0.350250 -0.180742 quarter
1 1 0.338961 -0.394274 -0.733235 quarter
2 2 1.054504 0.986728 -0.067777 quarter
3 3 0.873143 0.944501 0.071358 quarter
4 4 1.319607 0.762376 -0.557231 quarter
In [98]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Quarterly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(quarter_diffs['firm_id'], quarter_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter.png')
plt.show();

DIFFERENCE IN YEARLY AVERAGE OF EPS_FC AND EPS_ACT (Averages)

In [99]:
#isolate eps_fc and eps_act DFs by "yearly" average type
avgs_eps_fc_year = avgs_eps_fc[avgs_eps_fc['average_type'] == 'yearly']
avgs_eps_act_year = avgs_eps_act[avgs_eps_act['average_type'] == 'yearly']
In [100]:
#df of only yearly averages
avgs_year = pd.concat([avgs_eps_fc_year, avgs_eps_act_year])
In [101]:
#get the average average separately per firm
avgs_year_sep = avgs_year.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_year_sep.sample(5)
Out[101]:
firm_id feature average
903 451 eps_fc 0.496250
703 351 eps_fc 0.279560
339 169 eps_fc 0.180750
407 203 eps_fc 0.750571
216 108 eps_act 0.160038
In [102]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_year = avgs_year_sep.query('feature == "eps_fc"').average.values
avg_eps_act_year = avgs_year_sep.query('feature == "eps_act"').average.values

year_diff_data = list(zip(ids, avg_eps_fc_year, avg_eps_act_year))

year_diffs = pd.DataFrame(year_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
year_diffs['difference'] = year_diffs['average_eps_act'] - year_diffs['average_eps_fc']
year_diffs['average_type'] = 'year'
In [103]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
  • Put all the above graphs into a single FacetGrid for easy accessibility and comparing, in the Multivariate Exploration stage.
In [104]:
#put all yearly, quarterly, and twenty-year differences into one DF
all_diffs = pd.concat([twenty_diffs, quarter_diffs, year_diffs], sort = False)
all_diffs.sample(5)
Out[104]:
firm_id average_eps_fc average_eps_act difference average_type
120 120 0.120933 0.026486 -0.094447 quarter
253 253 0.980000 0.612702 -0.367298 twenty_year
118 118 0.726226 0.662107 -0.064119 quarter
123 123 0.619869 0.698101 0.078232 year
455 455 0.000550 0.171276 0.170726 year
  • Now that we looked at the raw data, let's create a line plot depicting the average difference among all average types.
In [105]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Differences in Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
In [106]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among All Averages by Firm ID\n(Twenty-Year, Yearly, and Quarterly)', size = 25)
plt.plot(all_diffs['firm_id'], all_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-all.png')
plt.show();

Observation: Since all graphs depicting the average differences of forecasted and actual EPS, it is safe to use the all_diffs DF to answer for the yearly, quarterly, and twenty-year interpretations.

Approach: Isolate the top 20 firms with the largest absolute differences (most inaccurate) and the top 20 firms with the smallest absolute differences (most accurate)

In [107]:
#helper function to convert a DF for firm_ids to their tickers
def convert_ids_to_tickers(series):
    series = series.apply(lambda x: firm_ids[firm_ids.firm_id == x].values[0][1]).values
    return [x.upper() for x in series]
In [108]:
#add column for absolute difference
all_diffs['difference_abs'] = all_diffs['difference'].abs()
In [109]:
#create column referring to firm ticker
all_diffs['firm_ticker'] = convert_ids_to_tickers(all_diffs.firm_id)
In [110]:
#rename column and values to fit legend
all_diffs.average_type = all_diffs.average_type.str.replace('_', ' ').str.capitalize()
all_diffs = all_diffs.rename(columns = {'average_type' : 'Average Type'})

Twenty-Year

In [111]:
#isolate twenty-year average type
twenty_diffs = all_diffs[all_diffs['Average Type'] == 'Twenty year']
In [112]:
#reorder firms by absolute difference
twenty_diffs_top = twenty_diffs.sort_values(by='difference_abs', ascending = False)
twenty_diffs_bottom = twenty_diffs.sort_values(by='difference_abs', ascending = True)
In [113]:
#drop rows with duplicate firm ids, get top 20 firm ids
twenty_diffs_top = twenty_diffs_top.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
twenty_diffs_bottom = twenty_diffs_bottom.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
In [114]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = twenty_diffs_top.firm_ticker, y = twenty_diffs_top.difference)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-top.png')

The 9 most inaccurately predicted firm tickers are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.

In [115]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = twenty_diffs_bottom.firm_ticker, y = twenty_diffs_bottom.difference)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-bottom.png')

At first, the trend appears unstable, and this is a tricky and wrong assumption. Visualizing that absolute value of each point proves that this trend among the top 20 most accurately predicted firms is a steady, linear pattern.

Quarterly

In [116]:
#isolate twenty-year average type
quarter_diffs = all_diffs[all_diffs['Average Type'] == 'Quarter']
In [117]:
#create temp DF to get absolute mean difference of all firms
quarter_diffs_means = quarter_diffs.groupby('firm_ticker').mean().reset_index()
In [118]:
#get firm tickers of top/bottom 20 firms
quarter_diffs_top_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
quarter_diffs_bottom_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [119]:
#filter quarterly DF for top/bottom firm tickers
quarter_diffs_top = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_top_ids)]
quarter_diffs_bottom = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_bottom_ids)]
In [120]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = quarter_diffs_top.firm_ticker, y = quarter_diffs_top.difference, order = quarter_diffs_top_ids)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-top.png')

The 9 most inaccurately predicted firms by quarterly average prediction error are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.

In [121]:
#plot firm IDs as CATEGORICAL variables by their respective quarterly differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = quarter_diffs_bottom.firm_ticker, y = quarter_diffs_bottom.difference, order = quarter_diffs_bottom_ids)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-bottom.png')

Similar to the previous twenty-year prediction error point plot, this graph shows a general oscillation between negative and positive values. However, this does not matter in the long run, because the absolute distance between 0 and all the above y-values decreases.

Yearly

In [122]:
#isolate yearly average types
year_diffs = all_diffs[all_diffs['Average Type'] == 'Year']
In [123]:
#create temp DF to get absolute mean difference for all firms
year_diffs_means = year_diffs.groupby('firm_ticker').mean().reset_index()
In [124]:
#get firm tickers of top/bottom 20 firms
year_diffs_top_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
year_diffs_bottom_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [125]:
#filter yearly DF for top/bottom firm tickers
year_diffs_top = year_diffs[year_diffs.firm_ticker.isin(year_diffs_top_ids)]
year_diffs_bottom = year_diffs[year_diffs.firm_ticker.isin(year_diffs_bottom_ids)]
In [126]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = year_diffs_top.firm_ticker, y = year_diffs_top.difference, order = year_diffs_top_ids)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-top.png')

The top 7 most inaccurately predicted firms by yearly average prediction errors are AIG, LRCX, GM, BKNG, CHTR, and NVR.

In [127]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = year_diffs_bottom.firm_ticker, y = year_diffs_bottom.difference, order = year_diffs_bottom_ids)

plt.suptitle('Top 20 Most Accurately Predicted Firms', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-bottom.png')

Consistent with previous trends, this graph of the 10 most accurate firms by yearly average predicted errors is varied in its collection of positive and negative values inching closer towards zero.

Twenty-Year, Quarterly, and Yearly

Main question: How do firms differ by forecast inaccuracy by average type (yearly, quarterly, and yearly)?

Steps:

1) Calculate the average yearly, quarterly, and twenty-difference for each firm.

2) Convert the new average differences to their absolute value.

3) Isolate the top 20 and bottom 20 firms by their average absolute distances.

4) Convert Firm IDs to Firm Tickers

5) Use those firm tickers to create a visualization using 2 different datasets:

  • the avgs DF (all raw data)
  • the all_diffs DF (filtered data)
In [128]:
#calculate average difference for all average types
all_diffs_means = all_diffs.groupby('firm_ticker').mean().reset_index()
In [129]:
#get firm tickers of top/bottom 20 firms
all_diffs_top_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
all_diffs_bottom_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [130]:
#filter DF for top/bottom firm tickers
all_diffs_top = all_diffs[all_diffs.firm_ticker.isin(all_diffs_top_ids)]
all_diffs_bottom = all_diffs[all_diffs.firm_ticker.isin(all_diffs_bottom_ids)]
In [131]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = all_diffs_top.firm_ticker, y = all_diffs_top.difference, order = all_diffs_top_ids, errwidth = 0.5)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-top.png')

The top 8 most inaccurate firms by all average types are AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.

In [132]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = all_diffs_bottom.firm_ticker, y = all_diffs_bottom.difference, order = all_diffs_bottom_ids, errwidth = 0.5)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-bottom.png')

The same trend of all values' absolute values steadily approaching 0. Nothing new.

Now, let's look at the top 20 most inaccurate firms, where I isolated the first couple of firms before the general trend started normalizing, approaching to 0.

  • twenty-year average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.

  • quarterly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.

  • yearly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, and NVR.

  • all average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.

From the previous summaries, some interesting observations pop up:

Observation 1: AIG is consistently the most inaccurately-predicted firm across all twenty-year, quarterly, and/or yearly average types.

Observation 2: The firms in common—AIG, LRCX, GM, BKNG, CHTR, and NVR—are all ranked in the same order on the x-axis. Across all average types, these firms are consistent in their order.

Observation 3: AGN and GOOGL appears in all the above average types except yearly. This means GOOGL EPS more accurately forecasted on a solely by-year basis.

Observation 4: AZO appears only in the top 20 most inaccurate quarterly average prediction error DataFrame.

Linear Regression Test for all EPS_ACT AVERAGES vs. EPS_FC AVERAGES

Considering that it's impossible to visually conclude a skew or distribution from the graphics alone, can we implement linear regression tests to verify that the relationship between forecasted and actual EPS is statistically significant?

In [133]:
def implement_ols(df, y, x, avg_type):
    lm = sm.OLS(df[y], df[['intercept', x]], missing = 'drop')
    result = lm.fit()
    
    p_val = result.pvalues[x]
    coeff = result.params[x]
    r = math.sqrt(result.rsquared)
    
    return (result.summary(), avg_type, p_val, coeff, r)

Isolate Twenty-Year EPS data by firm to fit OLS model

In [134]:
#helper function to separate eps_act and eps_fc averages into their own columns
def separate_eps_fc_act(df, groupby_arr, value_col):
    #combine eps_fc and eps_act averages into a single column, joined on firm id and time period
    df1 = df.groupby(groupby_arr)[value_col].apply(lambda x: ', '.join(x.astype(str))).reset_index()
    
    #separate eps_fc and eps_act averages then rename columns
    df1 = pd.concat([df1[groupby_arr], df1[value_col].str.split(', ', expand = True)], axis = 1)
    df1.rename(columns = {0: 'eps_fc', 1: 'eps_act'}, inplace = True)
    
    #convert data types
    df1 = df1.astype({'eps_fc' : 'float64', 'eps_act' : 'float64'})
    
    #add intercept
    df1['intercept'] = 1
    
    return df1
In [135]:
avgs_twenty_ols = separate_eps_fc_act(avgs_twenty, ['firm_id', 'average_type'], 'average')
avgs_twenty_ols.sample(10)
Out[135]:
firm_id average_type eps_fc eps_act intercept
337 337 twenty_year 0.551443 0.446789 1
444 444 twenty_year 0.720560 0.758864 1
432 432 twenty_year 0.598333 0.557525 1
241 241 twenty_year 0.446825 0.418724 1
304 304 twenty_year 0.504494 0.442187 1
494 494 twenty_year 1.006929 0.273489 1
229 229 twenty_year 0.269500 -0.125820 1
499 499 twenty_year 0.945000 0.596803 1
438 438 twenty_year 0.579723 0.579286 1
43 43 twenty_year 0.957988 0.782911 1
In [136]:
ax = sb.lmplot(data= avgs_twenty_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 25)
plt.suptitle('as Twenty-Year Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-twenty.png')

Observation 1: The relationship between actual and forecasted EPS shows a wide variance in y-values from x = 0.0 to x = 9.0.

Observation 2: Most of the points are clustered between 0.0 and 2.5 on the x-axis. Besides that cluster, all other points are outliers, which contribute not only to the scatterplot's visible variance, but also lead the trend towards a general positive linear trend (after ignoring the outlier at x = 5.0, y = 0.25, of course)

In [137]:
twenty_ols = implement_ols(avgs_twenty_ols, 'eps_act', 'eps_fc', 'Twenty-Year')
twenty_ols[0]
Out[137]:
OLS Regression Results
Dep. Variable: eps_act R-squared: 0.651
Model: OLS Adj. R-squared: 0.651
Method: Least Squares F-statistic: 927.1
Date: Mon, 04 May 2020 Prob (F-statistic): 1.33e-115
Time: 17:18:45 Log-Likelihood: -491.24
No. Observations: 498 AIC: 986.5
Df Residuals: 496 BIC: 994.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept -0.0007 0.037 -0.020 0.984 -0.073 0.072
eps_fc 0.8680 0.029 30.448 0.000 0.812 0.924
Omnibus: 366.690 Durbin-Watson: 1.978
Prob(Omnibus): 0.000 Jarque-Bera (JB): 163395.963
Skew: 1.937 Prob(JB): 0.00
Kurtosis: 91.654 Cond. No. 2.17


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Isolate Quarterly EPS data by firm to fit OLS model

In [138]:
avgs_quarter_ols = separate_eps_fc_act(avgs_quarter, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_quarter_ols.sample(10)
Out[138]:
firm_id average_type time_period eps_fc eps_act intercept
361 90 quarterly q2 0.988667 0.794286 1
188 47 quarterly q1 0.319429 0.333810 1
814 203 quarterly q3 0.770619 0.614033 1
784 196 quarterly q1 0.789333 0.911111 1
1839 459 quarterly q4 0.747714 0.405000 1
920 230 quarterly q1 0.892333 0.908095 1
311 77 quarterly q4 0.458857 0.190414 1
1623 405 quarterly q4 0.410476 0.381833 1
676 169 quarterly q1 0.413143 0.102143 1
1561 390 quarterly q2 0.594810 0.444464 1
In [139]:
ax = sb.lmplot(data= avgs_quarter_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Quarterly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-quarter.png')

Observation 1: Unlike the previous graph showing the twenty-year averages of forecasted and actual EPS, the 95% confidence interval for quarterly averages is much more narrow.

Observation 2: There is a similar pattern of all points being "clustered" at the beginning, with outliers being abundant outside of the initial x-range. Here, that range is from x = -1 to x = 3.5.

Observation 3: Outliers among EPS quarterly averages are much more varied and pronounced, which causes the regression line to be much less steep than the regression line for twenty-year EPS averages.

In [140]:
quarter_ols = implement_ols(avgs_quarter_ols, 'eps_act', 'eps_fc', 'Quarterly')
quarter_ols[0]
Out[140]:
OLS Regression Results
Dep. Variable: eps_act R-squared: 0.394
Model: OLS Adj. R-squared: 0.394
Method: Least Squares F-statistic: 1291.
Date: Mon, 04 May 2020 Prob (F-statistic): 3.26e-218
Time: 17:18:47 Log-Likelihood: -3163.6
No. Observations: 1988 AIC: 6331.
Df Residuals: 1986 BIC: 6342.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept -0.0228 0.033 -0.684 0.494 -0.088 0.042
eps_fc 0.8937 0.025 35.927 0.000 0.845 0.943
Omnibus: 3432.956 Durbin-Watson: 1.869
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15378611.603
Skew: 10.939 Prob(JB): 0.00
Kurtosis: 433.324 Cond. No. 2.13


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Isolate Yearly EPS data by firm to fit OLS model

In [141]:
avgs_year_ols = separate_eps_fc_act(avgs_year, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_year_ols.sample(10)
Out[141]:
firm_id average_type time_period eps_fc eps_act intercept
3138 149 yearly 2008 0.69800 0.653333 1
10177 484 yearly 2012 0.82525 0.850000 1
8541 406 yearly 2014 1.68875 1.797500 1
9739 463 yearly 2015 0.79075 0.795000 1
4638 220 yearly 2017 1.66575 1.522500 1
6583 313 yearly 2009 -0.08175 -0.725000 1
3233 153 yearly 2019 0.26325 0.243333 1
7288 347 yearly 2000 0.12450 0.112500 1
10083 480 yearly 2002 NaN NaN 1
8099 385 yearly 2013 1.43250 1.515000 1
In [142]:
ax = sb.lmplot(data= avgs_year_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Yearly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-year.png')

Observation 1: There is significantly less variance among yearly EPS averages, with only 2 outliers at x = -4 and x = 0.5.

Observation 2: Compared to twenty-year and quarterly EPS averages, the 95% confidence interval for yearly EPS averages is much more narrow than both of them.

Observation 3: Residuals tend to stay closer to the regression line, implying that there is lower bias in the relationship between forecasted and actual yearly EPS averages.

In [143]:
year_ols = implement_ols(avgs_year_ols, 'eps_act', 'eps_fc', 'Yearly')
year_ols[0]
Out[143]:
OLS Regression Results
Dep. Variable: eps_act R-squared: 0.262
Model: OLS Adj. R-squared: 0.262
Method: Least Squares F-statistic: 3147.
Date: Mon, 04 May 2020 Prob (F-statistic): 0.00
Time: 17:18:51 Log-Likelihood: -21086.
No. Observations: 8848 AIC: 4.218e+04
Df Residuals: 8846 BIC: 4.219e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept -0.1440 0.032 -4.530 0.000 -0.206 -0.082
eps_fc 1.0824 0.019 56.097 0.000 1.045 1.120
Omnibus: 17981.094 Durbin-Watson: 1.714
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5446459294.082
Skew: -15.036 Prob(JB): 0.00
Kurtosis: 3846.500 Cond. No. 2.09


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Isolate Raw EPS data by firm to fit OLS model

In [144]:
features_eps_fc_act = features.loc[features['feature'].isin(['eps_act', 'eps_fc'])]
features_all_ols = separate_eps_fc_act(features_eps_fc_act, ['firm_id', 'term'], 'value')
features_all_ols.sample(10)
Out[144]:
firm_id term eps_fc eps_act intercept
4008 47 2014Q1 0.482 0.500 1
23472 279 2008Q1 0.626 0.650 1
35966 428 2002Q3 NaN NaN 1
22267 265 2000Q4 NaN 0.200 1
22460 267 2007Q1 0.456 0.490 1
38112 453 2014Q1 0.022 0.030 1
40172 478 2004Q1 0.347 0.340 1
30198 359 2009Q3 0.316 0.270 1
41087 489 2001Q4 0.460 0.450 1
31948 380 2006Q1 0.543 0.555 1
In [145]:
ax = sb.lmplot(data= features_all_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.suptitle('Actual vs. Forecasted EPS', size = 20)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'features-act-fc-all.png')

Observation 1: The raw actual vs. forecasted EPS help paint a better overall picture of the trends.

Observation 2: The previous graphs depicting averages instead of raw data have clustered points at the beginning of the x-axis. But here, the scatterplot starts with 4 outliers until the first "cluster" of points takes place.

Observation 3: From x = 0 onward, the initial cluster of points gradually declusters afterwards while maintaining minimal distance from the regression line as residuals. Out of all the regression plots depicting actual vs. forecasted EPS, the raw data displays the least amount of variance. This potentially could mean the lowest amount of bias as well.

Observation 4: The slope for raw EPS data is less steep than all of the three previous graphs, and this relationship depicts a slight positive linear relationship: the weakest linear trend out of them all.

Observation 1: Interestingly enough, the graph encompassing the broadest data (all data from features) has the most stable 95% confidence interval as per the regression line.

Observation 2: The quarterly average regression line is the most unstable.

In [146]:
features_all_ols.head()
Out[146]:
firm_id term eps_fc eps_act intercept
0 0 1999Q1 NaN 0.16 1
1 0 1999Q2 NaN 0.35 1
2 0 1999Q3 NaN 0.30 1
3 0 1999Q4 NaN 0.32 1
4 0 2000Q1 NaN 0.29 1
In [147]:
all_ols = implement_ols(features_all_ols, 'eps_act', 'eps_fc', 'All')
all_ols[0]
Out[147]:
OLS Regression Results
Dep. Variable: eps_act R-squared: 0.129
Model: OLS Adj. R-squared: 0.129
Method: Least Squares F-statistic: 5115.
Date: Mon, 04 May 2020 Prob (F-statistic): 0.00
Time: 17:19:04 Log-Likelihood: -99603.
No. Observations: 34448 AIC: 1.992e+05
Df Residuals: 34446 BIC: 1.992e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept -0.1697 0.027 -6.399 0.000 -0.222 -0.118
eps_fc 1.1196 0.016 71.516 0.000 1.089 1.150
Omnibus: 106243.709 Durbin-Watson: 1.663
Prob(Omnibus): 0.000 Jarque-Bera (JB): 299298412516.719
Skew: 45.670 Prob(JB): 0.00
Kurtosis: 14442.998 Cond. No. 2.10


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Just to get a comprehensive view of all linear regression tests in 1 spot, I will display their p-value, coefficient, and r-value.

One-Tailed Test (Hypothesis Testing):

Before implementing any linear regression models, I propose the following null $H_0$ and alternative $H_1$ hypotheses:

$$H_0: Actual EPS - Forecasted EPS \leq 0$$$$H_1: Actual EPS - Forecasted EPS > 0$$

This hypothesis test uses a one-tailed approach to predict the likelihood of finding a value which lies in the extreme in favor of the alternative hypothesis. Our response variable is a difference in mean prediction error. The alternative hypothesis states that the difference in actual and forecasted EPS must be greater than 0, implying that actual EPS, on average, is much higher than its corresponding forecasted EPS.

The linear regression test provided by statsmodel implements an Ordinary-Least-Squares (OLS) method to calculate all the necessary parameters in our linear regression models: the p-value, coefficient (slope), and r-value (correlation coefficient).

Error Threshold

In order to give a point of comparison for our interpretations of all upcoming p-values, I decided to define a Type I error threshold of 0.05. The Type I error rate is often denoted by the $\alpha$ level, which is the probability of a Type I error given that the null hypothesis is true. [source]. A Type I error is defined as a false positive error, where you incorrectly reject a true null hypothesis. [source]. This means that you report your findings as significant when in fact, the variables have occurred solely by chance.

A Type II error is defined as a false negative, where you as the researcher fail to reject a false null hypothesis. This means that you report that there is no statistical significance when in fact, there really is.

Both Type I and II errors are inversely proportional; the more likely the probability of a Type I error will occur, the less likely a Type II error will occur, and vice versa.

Type I errors are known to be the "worst" type of errors--false positives lead researchers to arrive to conclusions that waste time, resources, and drawing connections between variables when really there are none. On the other hand, Type II errors usually lead to the preservation of the status quo [source]. Between these two errors, Type I leads to much more damage in the long-run, which is why I have established a Type I error threshold of 0.05 to compare against the following p-values.

In [148]:
ols_arr = [twenty_ols, quarter_ols, year_ols, all_ols]
In [149]:
statement = 'Actual EPS vs. Forecasted {} EPS:\np-value: {:.5f}\ncoefficient: {:.5f}\nr-value: {:.5f}\n'
In [150]:
for ol in ols_arr:
    print(statement.format(ol[1], ol[2], ol[3], ol[4]))
Actual EPS vs. Forecasted Twenty-Year EPS:
p-value: 0.00000
coefficient: 0.86799
r-value: 0.80713

Actual EPS vs. Forecasted Quarterly EPS:
p-value: 0.00000
coefficient: 0.89373
r-value: 0.62762

Actual EPS vs. Forecasted Yearly EPS:
p-value: 0.00000
coefficient: 1.08243
r-value: 0.51224

Actual EPS vs. Forecasted All EPS:
p-value: 0.00000
coefficient: 1.11957
r-value: 0.35956

Observation 1: Let's establish a Type I error threshold of 0.05, all actual EPS vs. any forecasted EPS type all depict statistically significant relationships, since all p-values are 0.000 and are less than the Type I error threshold.

Observation 2: All relationships are moderately positively correlated. For every 1 increase in the type of eps_fc, the expected actual EPS should increase by the coefficient, or slope. For example, for Actual EPS vs. Forecasted Quarterly EPS, for every 1 increase in quarterly eps_fc, then we should expect a 0.89373 increase for eod_act.

Let's look at the r-values (correlation coefficients) and what their values imply: source

  • Exactly -1: A perfect downhill (negative) linear relationship
  • -0.70: A strong downhill (negative) linear relationship
  • -0.50: A moderate downhill (negative) linear relationship
  • -0.30: A weak downhill (negative) linear relationship
  • 0: no linear relationship
  • +0.30 : A weak uphill (positive) linear relationship
  • +0.50 : A moderate uphill (positive) linear relationship
  • +0.70 : A strong uphill (positive) linear relationship

Observation 3: All r-values are positive, varying between 0.3 and 0.8. Here is a breakdown of each relationship and their correlation strength as defined by the correlation coefficient r:

  • Actual EPS vs. Forecasted Twenty-Year EPS: strong positive relationship
  • Actual EPS vs. Forecasted Quarterly EPS: moderate positive relationship
  • Actual EPS vs. Forecasted Yearly EPS: moderate positive relationship
  • Actual EPS vs. Forecasted All EPS: weak positive relationship

Observation 4:Notice how the full data, in its entirety, depicts the weakest positive relationship between actual and forecasted EPS. This intuitively makes sense; trends tend to become more unpredictable over longer spans of time, which give rise to outliers that weaken the overall direction of the distribution.

Because all p-values are less than 0.05, this means the explanatory variable eps_act is statistically significant in predicting any given EPS forecast. This is evidence in favor of my alternative hypothesis, therefore we reject the null hypothesis in favor of the alternative that there is no difference between the means of actual and all forecasted EPS.

"Dumb Predictions" vs "Expert Predictions"

1) Create a DF depicting eps_act last quarter vs eps_act this quarter.

2) Create "dumb predictions"

3) Compare my predictions to all eps_fc values

In [151]:
#isolate eps_act entries, drop date column
features_dumb_eps = features[features['feature'] == 'eps_act'].loc[:, features.columns != 'date']
In [152]:
#grab previous eps_act term
features_dumb_eps['previous_value'] = features_dumb_eps['value'].shift(1)

features_dumb_eps = features_dumb_eps[['firm_id', 'feature','term', 'value', 'previous_value']]
In [153]:
features_dumb_eps.head()
Out[153]:
firm_id feature term value previous_value
125240 0 eps_act 1999Q1 0.16 NaN
125241 0 eps_act 1999Q2 0.35 0.16
125242 0 eps_act 1999Q3 0.30 0.35
125243 0 eps_act 1999Q4 0.32 0.30
125244 0 eps_act 2000Q1 0.29 0.32

"Dumb Prediction" = the moving average of the values from the last 2 quarters.

In [154]:
#create dumb prediction
features_dumb_eps['dumb_prediction'] = features_dumb_eps.value.rolling(2).mean()
In [155]:
#preview
features_dumb_eps.head(5)
Out[155]:
firm_id feature term value previous_value dumb_prediction
125240 0 eps_act 1999Q1 0.16 NaN NaN
125241 0 eps_act 1999Q2 0.35 0.16 0.255
125242 0 eps_act 1999Q3 0.30 0.35 0.325
125243 0 eps_act 1999Q4 0.32 0.30 0.310
125244 0 eps_act 2000Q1 0.29 0.32 0.305
In [156]:
#create column showing corresponding eps_fc for that term

#create Series of eps_fc values
features_fc = features[features.feature == 'eps_fc'].drop(columns = ['date', 'feature'])
In [157]:
features_fc.head()
Out[157]:
firm_id term value
82820 0 1999Q1 NaN
82821 0 1999Q2 NaN
82822 0 1999Q3 NaN
82823 0 1999Q4 NaN
82824 0 2000Q1 NaN
In [158]:
#merge eps_fc values with dumb predictions on 'firm_id' and 'term'
features_dumb_eps = features_dumb_eps.merge(features_fc, on = ['term', 'firm_id'], how = 'left')
features_dumb_eps.rename(columns = {'value_x' : 'value', 'value_y' : 'eps_fc_value'}, inplace = True)
In [159]:
#create 3-year moving averages for all numerical fields
features_dumb_eps['ma_value'] = features_dumb_eps.value.rolling(12).mean()
features_dumb_eps['ma_previous_value'] = features_dumb_eps.previous_value.rolling(12).mean()
features_dumb_eps['ma_dumb_prediction'] = features_dumb_eps.dumb_prediction.rolling(12).mean()
features_dumb_eps['ma_eps_fc_value'] = features_dumb_eps.eps_fc_value.rolling(12).mean()
In [160]:
features_dumb_eps.sample(5)
Out[160]:
firm_id feature term value previous_value dumb_prediction eps_fc_value ma_value ma_previous_value ma_dumb_prediction ma_eps_fc_value
17010 202 eps_act 2009Q3 0.23 0.25 0.240 0.201 0.439167 0.459167 0.449167 0.452667
15173 180 eps_act 2012Q2 0.19 0.17 0.180 0.186 0.130833 0.121042 0.125938 0.129417
31734 377 eps_act 2015Q3 1.93 1.92 1.925 1.784 1.816667 1.794167 1.805417 1.669500
16295 193 eps_act 2019Q4 NaN 0.69 NaN 1.543 NaN 0.955000 NaN 1.172417
41265 491 eps_act 2004Q2 NaN NaN NaN NaN NaN NaN NaN NaN
In [161]:
features_dumb_eps.head()
Out[161]:
firm_id feature term value previous_value dumb_prediction eps_fc_value ma_value ma_previous_value ma_dumb_prediction ma_eps_fc_value
0 0 eps_act 1999Q1 0.16 NaN NaN NaN NaN NaN NaN NaN
1 0 eps_act 1999Q2 0.35 0.16 0.255 NaN NaN NaN NaN NaN
2 0 eps_act 1999Q3 0.30 0.35 0.325 NaN NaN NaN NaN NaN
3 0 eps_act 1999Q4 0.32 0.30 0.310 NaN NaN NaN NaN NaN
4 0 eps_act 2000Q1 0.29 0.32 0.305 NaN NaN NaN NaN NaN
In [162]:
fig = plt.figure(figsize = [10, 10])

sb.scatterplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'green')
sb.scatterplot(data = features_dumb_eps, x = 'eps_fc_value', y = 'value', color = 'purple')

fig.legend(labels = ['My Forecasted Eps', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')

plt.suptitle('Personal EPS Forecasts vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS Value')
plt.xlabel('Actual EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-fc-scatter.png')
plt.show();

Observation 1: With actual EPS as the x-value, we can see that my personal forecasts show much more variance with outliers spreading across the entire scatterplot, in all 4 quartiles. If we ignored the outliers, then all of my personal forecasts cluster around (0, 0).

Observation 2: Similarly, experts' forecasted EPS has only 3 outliers and generally stays uniformly clustered around (0,0). This means that my approach of generating personal forecasts by 2-quarter averages is much less accurate than the method that Bloomberg forecasters had used.

To confirm variance and bias, I decided to add two separate regression lines on the same scatterplot:

In [163]:
fig = plt.figure(figsize = [10, 10])

sb.regplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'green')
sb.regplot(data = features_dumb_eps, x = 'eps_fc_value', y = 'value', color = 'purple')

fig.legend(labels = ['My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')

plt.suptitle('Forecasted EPS vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS')
plt.xlabel('Actual EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-fc-reg.png')
plt.show();

Observation 1: Not surprisingly, my personal forecasted EPS has a much wider 95% confidence interval than experts' forecasted EPS. Due to the wider variance for the former, those 2-quarter rolling averages would subsequently come equipped with a broader confidence interval scope.

Observation 2: Experts' forecasted EPS, while counting outliers, exhibits a greater rate of increase compared to my personal forecasts after x = 0. This means that when compared to actual EPS values, experts' forecasts tend to surpass my personal forecasts.

Observation 3: When ignoring outliers, experts' forecasted EPS and my personal forecasts clusters uniformly around (0, 0), with slightly more of my personal forecasts taking up the left side of the "cluster:" before x = 0. This confirms my previous statement that although both variables show a general consistent positive linear trend, my personal forecasts are less reliable and more inaccurate due to having more variance, outliers insonsistent with the pairing regression line trend, and moving-averages depend solely on historical data.

Observation 4: Most of my personally generated outliers reside in Quartile I and the x-axis side of Quartile IV. All experts' forecasted EPS outliers reside in Quartiles II and III. Both forecasts' outliers are scattered in completely different quartiles from each other. This observation helps to further accentuate that although both regression lines share a linear positive trend, the distribution of outliers shows that both variables are not necessarily correlated to each other in terms of actual EPS.

In [164]:
fig = plt.figure(figsize = [10, 10])

sb.scatterplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'purple')

plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-scatter.png')
plt.show();
In [165]:
fig = plt.figure(figsize = [10, 10])

sb.regplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'purple')

plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-reg.png')
plt.show();

One-Tailed Test (Hypothesis Testing):

I propose the following null $H_0$ and alternative $H_1$ hypotheses:

$$H_0: Actual EPS - My Forecasted EPS \leq 0$$$$H_1: Actual EPS - My Forecasted EPS > 0$$

This hypothesis test also uses a one-tailed approach to predict the likelihood of finding a value which lies in the extreme in favor of the alternative hypothesis.

In [166]:
features_dumb_eps_ols = features_dumb_eps.copy()
features_dumb_eps_ols['intercept'] = 1
In [167]:
lm_dumb_act = sm.OLS(features_dumb_eps_ols['value'], features_dumb_eps_ols[['intercept', 'dumb_prediction']], missing = 'drop')
result_dumb_act = lm_dumb_act.fit()

result_dumb_act.summary()
Out[167]:
OLS Regression Results
Dep. Variable: value R-squared: 0.609
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 5.666e+04
Date: Mon, 04 May 2020 Prob (F-statistic): 0.00
Time: 17:19:19 Log-Likelihood: -92270.
No. Observations: 36357 AIC: 1.845e+05
Df Residuals: 36355 BIC: 1.846e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
intercept 0.0064 0.016 0.392 0.695 -0.026 0.038
dumb_prediction 1.0026 0.004 238.039 0.000 0.994 1.011
Omnibus: 50305.992 Durbin-Watson: 2.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 65797589437.202
Skew: 5.856 Prob(JB): 0.00
Kurtosis: 6593.463 Cond. No. 3.95


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [168]:
r = math.sqrt(result_dumb_act.rsquared)
#display r value
r
Out[168]:
0.780486843818858

Observation 1: Let's establish a Type I error threshold of 0.05. Here, actual EPS vs. my forecasted EPS depicts a statistically significant relationship, because p-value = 0, which is less than our Type I error threshold.

Observation 2: The slope associated with my dumb_prediction is 1.0026, which depicts a positive linear relationship. For every 1 increase in my dumb_prediction, then we should expect a 1.0026 increase in eod_act.

Observation 3: The correlation coefficient r = 0.78. This is between 0.70 and 1, which means that actual EPS vs. my forecasted EPS depicts a strong uphill (positive) linear relationship.

Because the p-value is less than 0.05, this means the explanatory variable eps_act is statistically significant in predicting my dumb_prediction personal EPS forecasts. This is evidence in favor of my alternative hypothesis; therefore we reject the null hypothesis in favor of the alternative that there is no difference between the means of actual EPS and my own forecasted EPS.

V) Multivariate Exploration

"Dumb Predictions" vs. "Expert Predictions"

In [169]:
value_vars = ['value', 'previous_value', 'dumb_prediction', 'eps_fc_value', 'ma_value', 'ma_previous_value',
             'ma_dumb_prediction', 'ma_eps_fc_value']
In [170]:
#melt dumb_predictions and eps_fc_value to prepare for plotting
df_dumb_eps_melt = pd.melt(features_dumb_eps, id_vars = ['firm_id', 'feature', 'term'],
       value_vars = value_vars,
       var_name = 'value_type')
In [171]:
#isolate only dumb predictions and eps_fc_value
dumb_eps_fc = df_dumb_eps_melt[df_dumb_eps_melt.value_type.isin(['dumb_prediction', 'eps_fc_value', 'value'])]
In [172]:
dumb_eps_fc['ma_value'] = dumb_eps_fc.value.rolling(12).mean()
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [173]:
dumb_eps_fc.sample(5)
Out[173]:
firm_id feature term value_type value ma_value
13196 157 eps_act 2001Q1 value 0.175 NaN
14000 166 eps_act 2013Q1 value 3.010 0.788333
41113 489 eps_act 2008Q2 value 0.720 0.674167
127704 5 eps_act 2005Q1 eps_fc_value NaN NaN
102461 209 eps_act 2015Q2 dumb_prediction 5.075 4.826327
In [174]:
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = dumb_eps_fc, hue = 'value_type',
                  errwidth = .5, palette = 'Set2')

plt.xticks(rotation = 'vertical')

leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})

plt.title('Average Forecasted EPS and My Predicted EPS by Term', size = 20)
plt.ylabel('Average EPS Value')
plt.xlabel('Term')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-eps.png')
plt.show();
In [175]:
#isolate actual EPS and personal predicted EPS
ma_dumb_fc = df_dumb_eps_melt[df_dumb_eps_melt.value_type.isin(['ma_dumb_prediction', 'ma_value', 'ma_eps_fc_value'])]
In [176]:
ma_dumb_fc.value_type.unique()
Out[176]:
array(['ma_value', 'ma_dumb_prediction', 'ma_eps_fc_value'], dtype=object)
In [177]:
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = ma_dumb_fc, hue = 'value_type',
                  errwidth = .5, palette = "Set2")

plt.xticks(rotation = 'vertical')

leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})

plt.suptitle('Average Forecasted EPS and My Predicted EPS by Term', size = 20, y = .95)
plt.title('in 3-Year Moving Averages from 1999 - 2019', size = 17)
plt.ylabel('Moving Average EPS Value')
plt.xlabel('Term')

plt.savefig(PATH_MULTIVARIATE + 'features-dumb-eps-ma.png')
plt.show();

Observation 1: In the first figure, my predicted EPS has strayed drastically away from the more normal actual EPS general trend pattern in 2 instances. Between 2000Q4 and 2001Q1, my forecasts become optimistic, forming a 2-point peak before jumping back into the uniform trend. Between 2008Q4 and 2009Q1, my forecasts become pessimistic, forming a 2-point trough before hopping back up into the uniform trend similar to the actual EPS trend.

Observation 2: When plotting moving averages per term, both trend lines follow an almost-identical polynomial trend, overall leading into a direction of a positive slope.

The technique that the Bloomberg forecasters used was much more effective than my method of predicting by using 2-quarter moving averages.

Time vs. Percentage Error vs. Firm IDs

In [178]:
#add percentage_error column to features
features_pct = features[features.feature.isin(['eps_fc', 'eps_act'])]
In [179]:
features_pct = separate_eps_fc_act(features_pct, ['firm_id', 'term'], 'value')
In [180]:
features_pct['pct_error'] = (features_pct.eps_act - features_pct.eps_fc) / features_pct.eps_act
In [181]:
features_pct['pct_error'] = features_pct['pct_error'].replace([np.inf, -np.inf], np.nan)
In [182]:
#get absolute value of percentage errors
features_pct['pct_error_abs'] = features_pct.pct_error.abs()
In [183]:
#add ticker column for plotting assisance
features_pct['firm_ticker'] = convert_ids_to_tickers(features_pct.firm_id)
In [184]:
#add 3-year moving averages for percentage error
features_pct['ma_pct_error'] = features_pct.pct_error.rolling(12).mean()
In [185]:
#grab top/bottom 10 firms
features_pct_top_ids = features_pct.sort_values(by = 'pct_error_abs', ascending = False).firm_id.drop_duplicates().head(5).values
features_pct_bottom_ids = features_pct.sort_values(by= 'pct_error_abs', ascending = True).firm_id.drop_duplicates().head(5).values
In [186]:
#filter DF for those 10 firms
features_pct_top = features_pct[features_pct.firm_id.isin(features_pct_top_ids)]
features_pct_bottom = features_pct[features_pct.firm_id.isin(features_pct_bottom_ids)]

Percentage Error by Year

In [187]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error',
                  hue = 'firm_ticker', dodge = True, legend_out = False, errwidth = 0.5)


plt.xlabel('Year')
plt.ylabel('Average Percentage Error')
plt.title('Average Percentage Error by Year (for the Top 5 Most Inaccurate Firms)', size = 20)

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top.png')
plt.show()

The above visual isn't great for analysis. Let's look at a strip plot distribution instead.

In [188]:
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error', hue = 'firm_ticker')
plt.xticks(rotation = 'vertical')

plt.suptitle('Distribution of EPS Percentage Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.xlabel('Year')
plt.ylabel('Percentage Error')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top-strip.png')

Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the year 2014.

Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by year.

Observation 3: All other firms beside IRM share large distance gaps from 2014 - 2018 with IRM in 2014.

Observation 4: Outliers only started showing up in distributions starting on the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent years, starting from 2014.

Percentage Error by Quarter

In [189]:
plt.figure(figsize = [10, 10])
ax = sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error',
                  hue = 'firm_ticker', jitter = True)

plt.suptitle('Distribution of EPS Percentage Error by Quarter', size = 15, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 13)
plt.ylabel('Average Percentage Error', size = 12)
plt.xlabel('Quarter', size = 12)
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter-top-strip.png')

Observation 1: The most notable outlier is IRM yet again, with a percentage error of over -1000 for EPS in Quarter 3.

Observation 2: IBM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by quarter. This list is the same as the previous stripplot and pointplot depicting years.

Observation 3: Quarter 4 is the only quarter with no outliers. This makes intuitive sense because as the quarters in any singular year pass and familiarity with the current fiscal year builds, the more accurate forecasted EPS would be. Let us test this theory.

In [190]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error', hue = 'firm_ticker', dodge = True,
                  legend_out = False, errwidth = 0.5)

plt.xlabel('Quarter', size = 17)
plt.ylabel('Average Percentage Error', size = 17)
plt.title('Average Percentage Error by Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(size = 15)
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter.png')
plt.show()

Observation 1: The firms IRM and IBM deviate greatly in Quarter 3, where forecasters' predictions veer away from zero. Meanwhile, the other firms cluster more closely around 0. The most inaccurate predictions, on average, happen to be made during Quarter 3 in any given year.

Observation 2: The previous observation disproves my initial intuition that forecasters' predictions would become more accurate by the quarter.

Percentage Error by Year and Quarter

In [191]:
plt.figure(figsize = [22, 15])
ax = sb.pointplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker', dodge = True,
                  legend_out = False)

plt.xlabel('Term')
plt.ylabel('Percentage Error')
plt.title('Percentage Error by Year and Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-term.png')
plt.show()

This isn't a very good visualization either. Let's look at a stripplot instead:

In [192]:
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker')

plt.suptitle('Distribution of EPS Percentage Error by Term', size = 20, y = .94)
plt.title('for the top 5 Most Inaccurate Firms', size = 17)

plt.xlabel('Term', size = 17)
plt.ylabel('Percentage Error', size = 17)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-ferm-tirm-strip.png')
plt.show();

Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q3.

Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by term. This list is consistent for the previous plots feature standalone quarters and years.

Observation 3: outliers start after 2014Q3. In the next terms, outliers tend to year in Q1 and Q3 exclusively; much more Q3.

Observation 4: All other firms beside IRM share large distance gaps from 2014Q3 - 2018Q3 from IRM in 2014Q3.

Observation 5: Outliers only started showing up in distributions starting in the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. This conclusion is consistent with the observation made in the quarterly and yearly stripplots & pointplots.

Prediction Error by Term by Top 5 Firms

In [193]:
features_pct['difference'] = features_pct.eps_act - features_pct.eps_fc
In [194]:
features_pct['difference_abs'] = features_pct.difference.abs()
In [195]:
diffs_top_ids = features_pct.sort_values(by = 'difference_abs', ascending = False).firm_id.drop_duplicates().head(5).values
diffs_bottom_ids = features_pct.sort_values(by = 'difference_abs', ascending = True).firm_id.drop_duplicates().head(5).values
In [196]:
diffs_top = features_pct[features_pct.firm_id.isin(diffs_top_ids)]
diffs_bottom = features_pct[features_pct.firm_id.isin(diffs_bottom_ids)]

Prediction Error by Term

In [197]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term, y = 'difference', hue = 'firm_ticker', errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Term')
plt.ylabel('Average Prediction Error')
plt.suptitle('Average EPS Prediction Error by Term', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-term-firm.png')
plt.show();

Observation 1: The 2 most noticeable outliers are the firm AIG in 2008Q4 (optimistic forecasts), and LRCX in 2000Q3 (pessimistic forecasts).

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN.

Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; its average prediction error by term hugs closely to a slopeless y = 0 trend.

Prediction Error by Year

In [198]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.year, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-year-firm.png')
plt.show();

Observation 1: Yet again, the 2 most noticeable outliers are the firm AIG in 2008 (optimistic forecasts), and LRCX in 2000 (pessimistic forecasts).

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as before.

Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.

Prediction Error by Quarter

In [199]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.quarter, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xlabel('Quarter')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Quarter', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.xticks(size = 15)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-quarter-firm.png')
plt.show();

Observation 1: One most noticable outlier is AIG, which is, on average, the only outlier in Q3 of any given year.

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as generated through a yearly and term basis.

Observation 3: These 5 firms display relatively low prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions. This means that forecasters, on average, are likely to be more inaccurate in their EPS forecasts in Q4 of any given year.

Observation 4: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.

Conclusions

Research Questions

Question 1: Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?

  1. According to figure (features-act-fc-diffs-term), forecasters were most optimistic in 2008, Quarter 4, and most pessimistic in 2000, Quarter 4.

  2. According to figure (features-act-fc-diffs-year), forecasters were most pessimistic in the year 2000, and most optimistic in the year 2008.

  3. The year 2000 coincidentally shows one of the widest variances in average prediction errors (ranging from 0 to 0.5).

  4. The year 2008 also shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.

  • The two years where forecasters made overly pessimistic/optimistic predictions also coincidentally happened to contain the highest variance, and therefore, the forecasts are very spread out from the mean.
  1. According to figure (features-act-fc-diffs-quarter.png), the later the quarter, the more optimistic forecasters become in their predictions. This intuitively makes sense; as the familiarity of the year increases and patterns become established, the more I conjecture that forecasters would become confident in their EPS predictions.
  • However, figure (pct-firm-quarter) proves my intuition wrong because the most inaccurate predictions, on average, happen to be made during Quarter 4 in any given year. [POSSIBLY REMOVE THIS]
  1. The overall trend between the first two figures depicting average EPS prediction error by quarter and year is consistent. When ignoring the pessimistic and optimistic outliers, there is no slope; all average EPS forecasts gather around 0 by the year.

  2. It is only by a quarterly basis that we see a predictable pattern emerge, where outliers (usually optimistic) happen exclusively during Q4.

  3. When comparing the quarterly trends to the full-term trends, indeed, the 2 outliers occur solely in Quarter 4 of 2000 and 2008. Therefore, our yearly and quarterly figures display congruent trends & patterns with the figure depicting full-terms.

Question 2: I generate “dumb EPS forecasts” by calculating the rolling mean of actual EPS from the past 2 quarters. How do my forecasted EPS forecasts compare to Bloomberg forecasts?

  1. As depicted in figure (features-dumb-eps), my average predicted EPS more closely follows the average actual EPS trend instead of the forecasters'. This is a key takeaway: my method of using 2-quarter moving averages was much more effective in predicting avrage actual EPS values than the method used by Bloomberg forecasters.

  2. though only slightly more unstable. Variance among my predictions is much higher, as shown through the error bars per term.

  3. My personal predictions "spiked" and "troughed" in the terms 2000Q4 and 2008Q4—the exact same pattern as depicted through average forecasted EPS.

  4. Compared to the average forecasted EPS, my average predicted EPS is much more congruent with the trend of actual EPS, though only slightly more unstable. All of my predictions contain higher variance than actual EPS, which reduces my method's credibility when accounting for all individual data points.

  5. Bloomberg's forecasted EPS displays less variance, and is slightly more stable and reliable than my method of using rolling 2-quarter, or half-year, averages.

  6. The p-value generated between actual EPS and my forecasted EPS depicts a statistically significant relationship, with a p-value of 0.00.

  7. Similarly, the p-value generated between actual EPS and experts' forecasted EPS depicts a statistically significant relationship, with a p-value of 0.00.

Thus, these previous 2 results lead me to reject the null hypothesis in favor of the alternative that there is no difference between the means of actual EPS and both EPS forecast types: of my own, and of Bloomberg's. Therefore, regardless of my predictions containing higher variance, these results show that the 2 relationships are not explainable by chance alone. [source]

Question 3: What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?

  1. For prediction errors, forecasters, on average, are likely to be more inaccurate in their predictions in Q4 of any given year. This is depicted in figure (features-perror-quarter-firm), where the top 5 most inaccurate firms by prediction error display relatively low average prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions on the x-axis.

  2. For percentage errors, EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. Outliers only started to appear in the trend depicted through (pct-tirm-term.png) starting in the year 2014. This conclusion is consistent with that observations made in the quarterly and yearly stripplots: (pct-firm-quarter) and (pct-firm-quarter-top-strip) for quarterly data, and (pct-firm-year) and (pct-firm-year-top-strip) for yearly data.

  3. The top 5 most inaccurate firms in terms of absolute prediction error are AGN, AIG, CHTR, LRCX, and VRSN. Respectively, these firm tickers stand for Allergan plc, American International Group Inc, Charter Communications Inc, Lam Research Corporation, and Verisign.

  4. The top 5 most inaccurate firms in terms of absolute percentage error are IBM, IRM, MCK, PXD, and QRVO. Respectively, these firm tickers stand for IBM Common Stock, Iron Mountain Inc, McKesson Corporation, Pioneer Natural Resources, and Qorvo Inc.

  5. For prediction error, the most notable outlier is AIG. This firm is, on average, the only outlier in Q3 of any given year.

  6. For percentage error, the most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q3. Outliers only started appearing starting in the year 2014

Question 4: Does statistical significance stay true among the relationships between actual EPS and forecasted EPS, regardless of forecasted type (raw data along with all yearly, quarterly, and twenty-year averages)?

  1. Short answer: Yes. Across the figures (avgs-act-fc-twenty, avgs-act-fc-quarter, avgs-act-fc-year, and features-act-fc-all.png), all p-values are recorded at 0.00, which is less than the Type I error threshold of 0.05.

  2. The figures all depict a moderately positively correlated relationship between actual EPS and forecasted EPS, regardless of type.

  3. Among the OLS models, all r-values are positive, with values varying between 0.3 and 0.8. Curiously enough, when using the raw forecasted EPS data instead of averages, a weak positive relationship emerges. This intuitively makes sense, as trends tend to become more unpredictable over longer spans of time. In effect, a longer, unpredictable time series would give rise to outliers that weaken the overall direction of the distribution.

  4. From x = 0 onward, the initial cluster of points for each figure gradually declusters, all while maintaining minimal distance from the regression line as residuals. Out of all the regression plots depicting actual vs. forecasted EPS, the raw data displays the least amount of variance This could potentially mean the raw data contains the lowest amount of bias as well.

Question 5: How do EOD Prices trend across all firms from 1999 - 2019?

  1. EOD prices show a general positive trend from 2009, as can be seen with figure (features-eod-term).

  2. EOD prices see a "trough" from 2007 - 2009: the exact years of the Great Recession.

  3. All patterns shown in the Actual vs. Forecasted pointplots are consistent with the stripplot (features-feature-value) at the beginning of the bivariate stage. Actual EPS has the most outliers, and there a few "spikes" in the pointplots where actual EPS significantly veers away from the forecasted EPS trend line.

FUTURE WORK

  • Acknowledge multicollinearity
  • Make more use of eps_fc_terms
  • Look at other forecasts made from different periods, not just 3 months before. This would give me a greater insight into forecasting patterns made at various points in time before the current fiscal period.
In [200]:
#convert notebook to HTML
from subprocess import call
call(['python', '-m', 'nbconvert', 'data_exploratory.ipynb'])
Out[200]:
0